Speech Emotion Recognition for Market Research¶

process.jpg

1. Summary¶

In market research, obtaining reliable and trustworthy survey data is a challenge for surveyors. This is a challenge because the survey results obtained can be subjective or dishonest answers due to conflicts of interest. This project started by audio data preprocessing, exploratory data analysis, feature extraction, modelling, model evaluation, and conclusion. This project can help various department ranging from marketing and market research departments, the customer service department, the Human Resources department, the recruitment and job interviews, and employee training and development. And, this project can be implemented in various industries as well as telecommunication, internet provider, banking or fintech, home & household appliance, customer service, human resources / outsourcing, travel & hotels, training & education, call center, etc.


2. Preface¶

2.1 Background¶

Communication is defined as the process of understanding and sharing meaning (Pearson & Nelson, 2000). In the certain business area, communication is a way to understand the interlocutor as well as customers, potential customers, conversation partner, vendor, etc. Especially in market research at marketing department, the company must be able to gather valuable information from customers as much as possible so that the company can grow exponentially by understanding the customers or potential customers needs & wants.

The company uses Value Preposition Canvas (VPC) framework to conduct product development so that the company can generate a new product which represent the customer's needs. This framework helps companies or entrepreneurs to solve problems and satisfy the needs of the customer by discovering the customer's pain through identifying the customer's jobs that need to be done. Therefore, to make a customer's job list the company requires to conduct qualitative research or quantitative research.

Using the company's resources and capabilities, the valuable informations can be used to conduct market research, campaign analysis, product development, process improvement, service improvement, customer satisfaction, product evaluation, service evaluation, customer behavior and so on.


2.2 Business Issue Exploration¶

There are several informations that company can obtain or gather ranging from needs, wants, complain, review, feed back towards product, response towards new campaign, customer sentiment, etc. Mostly the company's data is taken from questionnaires, observation, interview, social media. Hence, there are two type of datas which taken from survey namely audio data and text data.

But the problem is the results of data collection is subjective or dishonest answer due to conflict of interest towards brand or surveyor. And it will impact to the company's ability to conduct market research also ability to understand the customers very well. Therefore, it will lead to misleading information, higher customer churn rate, miss communication, bad customer experience, and so on.


2.3 Project Idea¶

In order to minimize misleading information when the company is conducting market research, hence this project develops a classification model which can classify the emotion of a person towards a product, a service, or specific campaign. And the emotions that model classify is anger, happy, neutral, sad.


2.4 Problem Scope¶

This project only processes the audio data set to classify human emotions. This project uses CREMA-D data set which is taken from https://github.com/CheyneyComputerScience/CREMA-D . CREMA-D is a data set of 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities (African America, Asian, Caucasian, Hispanic, and Unspecified). Actors spoke from a selection of 12 sentences. The sentences were presented using one of six different emotions (Anger, Disgust, Fear, Happy, Neutral, and Sad) and four different emotion levels (Low, Medium, High, and Unspecified). But, in order to make a prototype with the limited resources this project only proceed with four human emotions as target class and with small number of clips (1183 clips).

The data set is an accordance with business need for this project due to:

  • The data set represents the diversity of human emotions.
  • The size of data set has a sufficiently large number of samples.
  • The data set of CREMA-D and Tess come up with metadata that provides additional information about expressed emotions, speaker identities, gender, and other relevant information for emotional speech recognition project.
  • These data sets also include variations in language and pronunciation accents, which can be valuable in building a more robust model in recognizing emotional speech from diverse linguistic backgrounds.

2.5 Desired Output¶

The project output is a dashboard with classification model that can classify several emotions ranging from anger, happiness, sadness, and neutral in real time. Then, there are several features that dashborad provided as well as:

  • Human Emotion Classification Result for audio & video file input.
  • Human Emotion Classification Result in Real Time.
  • Results Time range for each human emotions result generated for real time and audio & video file input.
  • Face Emotion Recognition Result in Real TIme.

2.6 Business Impact¶

Therefore, by building emotion speech recognition project can help the company to grow, as follows:

  • In marketing and market research departments, emotional speech recognition can be used to analyze customer sentiments, feelings, and emotional responses. This model helps companies understand how customers emotionally respond towards a product, a service, or specific campaign and make decisions based on the analysis. The company can gather customer voice data through various channels, such as recorded customer service calls, one-on-one customer interview, focus group discussion recording, or voice recordings uploaded by customers in the form of testimonials or product reviews in social media.

  • In the customer service department, emotional speech recognition can be used to understand the emotions and feelings of customers during interactions with customer service agents. As a result, companies can respond better and provide appropriate solutions to enhance customer satisfaction, customer experience, and customer loyalty with strategic communication improvement.

  • In the Human Resources department, emotional speech recognition can be used to monitor and analyze employees’ emotional expressions during meetings, presentations, or team interactions. This information can help managers or HR teams to understand employees’ satisfaction levels, anxiety, or happiness so that the HR teams can take appropriate actions to improve their well-being.

  • In the recruitment and job interviews, emotional speech recognition can assist companies in analyzing the speech and emotional responses of candidates during job interviews. And, it can provide additional insights into their personality, interpersonal skills, and cultural fit with the company.

  • In employee training and development, emotional speech recognition can be used to provide real-time feedback and evaluation on how employees communicate emotionally. This can help to improve communication skills, emotional management, and interpersonal interactions, human resources / outsourcing.

This project can be implemented in various industries as well as telecommunication, internet provider, banking or fintech, home & household appliance, customer service, human resources / outsourcing, travel & hotels, training & education, call center, etc.


2.7 User Target & Benefits¶

  • In marketing and market research departments, emotional speech recognition can be used to analyze customer sentiments, feelings, and emotional responses towards products, services, or specific campaigns. This helps companies understand how customers emotionally respond and make decisions based on the analysis. The company can gather customer voice data through various channels, such as recorded customer service calls, interview or focus group discussion recording, or voice recordings uploaded by customers in the form of testimonials or product reviews.

  • In the customer service department, emotional speech recognition can be used to understand the emotions and feelings of customers during interactions with customer service agents. As a result, companies can respond better and provide appropriate solutions to enhance customer satisfaction, customer experience, and customer loyalty with strategic communication improvement.

  • In the Human Resources department, emotional speech recognition can be used to monitor and analyze employees’ emotional expressions during meetings, presentations, or team interactions. This information can help managers or HR teams to understand employees’ satisfaction levels, anxiety, or happiness so that the HR teams can take appropriate actions to improve their well-being.

  • In the recruitment and job interviews, emotional speech recognition can assist companies in analyzing the speech and emotional responses of candidates during job interviews. And, it can provide additional insights into their personality, intrapersonal skills, and cultural fit with the company.

  • In employee training and development, emotional speech recognition can be used to provide real-time feedback and evaluation on how employees communicate emotionally. This can help to improve communication skills, emotional management, and interpersonal interactions, human resources / outsourcing.


2.8 Industry Implementation¶

This project can be implemented in various industries as well as telecommunication, internet provider, banking or fintech, home & household appliance, customer service, human resources / outsourcing, travel & hotels, training & education, call center, etc.


3. Library¶

In [16]:
# Base library
import os
import math

# Exploratory data
import pandas as pd
import numpy as np
from collections import Counter
%matplotlib inline
import matplotlib.pyplot as plt
import librosa
import librosa.display

# Playing the audio
from IPython.display import display
import IPython.display as ipd

4. Audio Data Preprocessing¶

4.1 Read Audio Files¶

In [221]:
path = 'data_input/'  

# Fetches all filenames in the folder
files = os.listdir(path)


# Membuat dictionary untuk menyimpan kondisi sesuai dengan nama depan file
conditions = {}
for file in files:
    name_parts = file.split('_')
    name_1 = name_parts[0]
    name_2 = name_parts[2]
    name_3 = name_parts[3]
    if name_1 == '1001':
        conditions[file] = {'Age': 51, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1002':
        conditions[file] = {'Age': 21, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1003':
        conditions[file] = {'Age': 21, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1004':
        conditions[file] = {'Age': 42, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1005':
        conditions[file] = {'Age': 29, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1006':
        conditions[file] = {'Age': 58, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1007':
        conditions[file] = {'Age': 38, 'Sex': 'Female', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1008':
        conditions[file] = {'Age': 46, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1009':
        conditions[file] = {'Age': 24, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1010':
        conditions[file] = {'Age': 27, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1011':
        conditions[file] = {'Age': 32, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1012':
        conditions[file] = {'Age': 23, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1013':
        conditions[file] = {'Age': 22, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Hispanic'}
    elif name_1 == '1014':
        conditions[file] = {'Age': 24, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1015':
        conditions[file] = {'Age': 32, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1016':
        conditions[file] = {'Age': 61, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1017':
        conditions[file] = {'Age': 42, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1018':
        conditions[file] = {'Age': 25, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Hispanic'}
    elif name_1 == '1019':
        conditions[file] = {'Age': 29, 'Sex': 'Male', 'Race': 'Asian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1020':
        conditions[file] = {'Age': 61, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1021':
        conditions[file] = {'Age': 30, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1022':
        conditions[file] = {'Age': 22, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1023':
        conditions[file] = {'Age': 22, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1024':
        conditions[file] = {'Age': 59, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1025':
        conditions[file] = {'Age': 48, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1026':
        conditions[file] = {'Age': 33, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1027':
        conditions[file] = {'Age': 44, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1028':
        conditions[file] = {'Age': 57, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1029':
        conditions[file] = {'Age': 33, 'Sex': 'Female', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1030':
        conditions[file] = {'Age': 42, 'Sex': 'Female', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1031':
        conditions[file] = {'Age': 31, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Hispanic'}
    elif name_1 == '1032':
        conditions[file] = {'Age': 30, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1033':
        conditions[file] = {'Age': 31, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1034':
        conditions[file] = {'Age': 74, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1035':
        conditions[file] = {'Age': 48, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1036':
        conditions[file] = {'Age': 49, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1037':
        conditions[file] = {'Age': 45, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1038':
        conditions[file] = {'Age': 21, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1039':
        conditions[file] = {'Age': 51, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1040':
        conditions[file] = {'Age': 42, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1041':
        conditions[file] = {'Age': 42, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1042':
        conditions[file] = {'Age': 37, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1043':
        conditions[file] = {'Age': 25, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Hispanic'}
    elif name_1 == '1044':
        conditions[file] = {'Age': 40, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1045':
        conditions[file] = {'Age': 22, 'Sex': 'Male', 'Race': 'Asian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1046':
        conditions[file] = {'Age': 22, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Hispanic'}
    elif name_1 == '1047':
        conditions[file] = {'Age': 22, 'Sex': 'Female', 'Race': 'Unknown', 'Ethnicity': 'Hispanic'}
    elif name_1 == '1048':
        conditions[file] = {'Age': 38, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Hispanic'}
    elif name_1 == '1049':
        conditions[file] = {'Age': 25, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Hispanic'}
    elif name_1 == '1050':
        conditions[file] = {'Age': 62, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1051':
        conditions[file] = {'Age': 56, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1052':
        conditions[file] = {'Age': 33, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1053':
        conditions[file] = {'Age': 35, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1054':
        conditions[file] = {'Age': 36, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1055':
        conditions[file] = {'Age': 57, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1056':
        conditions[file] = {'Age': 52, 'Sex': 'Female', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1057':
        conditions[file] = {'Age': 25, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1058':
        conditions[file] = {'Age': 36, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1059':
        conditions[file] = {'Age': 21, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1060':
        conditions[file] = {'Age': 28, 'Sex': 'Female', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1061':
        conditions[file] = {'Age': 51, 'Sex': 'Female', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1062':
        conditions[file] = {'Age': 56, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1063':
        conditions[file] = {'Age': 33, 'Sex': 'Female', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1064':
        conditions[file] = {'Age': 53, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1065':
        conditions[file] = {'Age': 38, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1066':
        conditions[file] = {'Age': 25, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1067':
        conditions[file] = {'Age': 66, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1068':
        conditions[file] = {'Age': 34, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1069':
        conditions[file] = {'Age': 27, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1070':
        conditions[file] = {'Age': 25, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1071':
        conditions[file] = {'Age': 41, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1072':
        conditions[file] = {'Age': 33, 'Sex': 'Female', 'Race': 'Asian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1073':
        conditions[file] = {'Age': 24, 'Sex': 'Female', 'Race': 'African_American', 'Ethnicity': 'Hispanic'}
    elif name_1 == '1074':
        conditions[file] = {'Age': 31, 'Sex': 'Female', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1075':
        conditions[file] = {'Age': 40, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1076':
        conditions[file] = {'Age': 25, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1077':
        conditions[file] = {'Age': 20, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1078':
        conditions[file] = {'Age': 21, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1079':
        conditions[file] = {'Age': 21, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Hispanic'}
    elif name_1 == '1080':
        conditions[file] = {'Age': 21, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1081':
        conditions[file] = {'Age': 30, 'Sex': 'Male', 'Race': 'Asian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1082':
        conditions[file] = {'Age': 20, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1083':
        conditions[file] = {'Age': 45, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1084':
        conditions[file] = {'Age': 46, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1085':
        conditions[file] = {'Age': 34, 'Sex': 'Male', 'Race': 'Asian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1086':
        conditions[file] = {'Age': 33, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1087':
        conditions[file] = {'Age': 62, 'Sex': 'Male', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1088':
        conditions[file] = {'Age': 23, 'Sex': 'Male', 'Race': 'African_American', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1089':
        conditions[file] = {'Age': 24, 'Sex': 'Female', 'Race': 'Caucasian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1090':
        conditions[file] = {'Age': 50, 'Sex': 'Male', 'Race': 'Asian', 'Ethnicity': 'Not_Hispanic'}
    elif name_1 == '1091':
        conditions[file] = {'Age': 29, 'Sex': 'Female', 'Race': 'Asian', 'Ethnicity': 'Not_Hispanic'}
      
    if name_2 == 'ANG':
        Level = 'Anger'
    elif name_2 == 'HP':
        Level = 'Happiness'
    elif name_2 == 'SAD':
        Level = 'Sadness'
    else:
        Level = 'Neutral'
        
    if name_3 == 'HI':
        Level = 'High'
    elif name_3 == 'MD':
        Level = 'Mid'
    elif name_3 == 'LO':
        Level = 'Low'
    else:
        Level = 'Unspecified'

# Konversi dictionary menjadi dataframe
df = pd.DataFrame.from_dict(conditions, orient='index')
In [222]:
audio = df.reset_index()
audio.columns = ['File_Name','Age','Sex','Race','Ethnicity']


# Sort by filename index-0, index-2, and index-3
audio = audio.sort_values(by=['File_Name'])

# Create a new column from splitting the File_Name column for sort value matter
audio[['col0', 'col2', 'col3']] = audio['File_Name'].str.split('_', expand=True)[[0, 2, 3]]
audio = audio.sort_values(by=['col0', 'col2', 'col3'])

audio
Out[222]:
File_Name Age Sex Race Ethnicity col0 col2 col3
0 1001_IEO_ANG_HI.wav 51 Male Caucasian Not_Hispanic 1001 ANG HI.wav
1 1001_IEO_ANG_LO.wav 51 Male Caucasian Not_Hispanic 1001 ANG LO.wav
2 1001_IEO_ANG_MD.wav 51 Male Caucasian Not_Hispanic 1001 ANG MD.wav
12 1001_WSI_ANG_XX.wav 51 Male Caucasian Not_Hispanic 1001 ANG XX.wav
3 1001_IEO_HAP_HI.wav 51 Male Caucasian Not_Hispanic 1001 HAP HI.wav
... ... ... ... ... ... ... ... ...
1181 1091_IWW_NEU_XX.wav 29 Female Asian Not_Hispanic 1091 NEU XX.wav
1176 1091_IEO_SAD_HI.wav 29 Female Asian Not_Hispanic 1091 SAD HI.wav
1177 1091_IEO_SAD_LO.wav 29 Female Asian Not_Hispanic 1091 SAD LO.wav
1178 1091_IEO_SAD_MD.wav 29 Female Asian Not_Hispanic 1091 SAD MD.wav
1179 1091_IOM_SAD_XX.wav 29 Female Asian Not_Hispanic 1091 SAD XX.wav

1183 rows × 8 columns

In [223]:
# Drop unnecessary columns
audio = audio.drop(columns=['File_Name', 'col0'])

# Create a new column from splitting the col3 column to get Emotion & Emotion Level
audio[['Emotion_Lvl','Format']]=audio['col3'].str.split('.',expand=True)[[0,1]]

# Drop unnecessary columns
audio = audio.drop(columns=['col3','Format'])

# Rename col2 column
audio.rename(columns = {'col2' : 'Emotion'}, inplace = True)

# Reset Index
audio = audio.reset_index()
audio = audio.drop(columns=['index'])
audio
Out[223]:
Age Sex Race Ethnicity Emotion Emotion_Lvl
0 51 Male Caucasian Not_Hispanic ANG HI
1 51 Male Caucasian Not_Hispanic ANG LO
2 51 Male Caucasian Not_Hispanic ANG MD
3 51 Male Caucasian Not_Hispanic ANG XX
4 51 Male Caucasian Not_Hispanic HAP HI
... ... ... ... ... ... ...
1178 29 Female Asian Not_Hispanic NEU XX
1179 29 Female Asian Not_Hispanic SAD HI
1180 29 Female Asian Not_Hispanic SAD LO
1181 29 Female Asian Not_Hispanic SAD MD
1182 29 Female Asian Not_Hispanic SAD XX

1183 rows × 6 columns

4.2 Change The Data Types¶

In [224]:
audio.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1183 entries, 0 to 1182
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Age          1183 non-null   int64 
 1   Sex          1183 non-null   object
 2   Race         1183 non-null   object
 3   Ethnicity    1183 non-null   object
 4   Emotion      1183 non-null   object
 5   Emotion_Lvl  1183 non-null   object
dtypes: int64(1), object(5)
memory usage: 55.6+ KB
In [225]:
# Change to category data type
audio[['Sex','Race','Ethnicity','Emotion','Emotion_Lvl']] = audio[['Sex','Race','Ethnicity','Emotion','Emotion_Lvl']].astype('category')
audio.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1183 entries, 0 to 1182
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   Age          1183 non-null   int64   
 1   Sex          1183 non-null   category
 2   Race         1183 non-null   category
 3   Ethnicity    1183 non-null   category
 4   Emotion      1183 non-null   category
 5   Emotion_Lvl  1183 non-null   category
dtypes: category(5), int64(1)
memory usage: 16.0 KB

4.3 Data Description¶

In [234]:
audio.nunique()
Out[234]:
Age            38
Sex             2
Race            4
Ethnicity       2
Emotion         4
Emotion_Lvl     4
dtype: int64

Insight:

The nunique() function result above shows that the Age column has 38 unique values, Sex column has 2 unique values, Race column has 4 unique values, Ethnicity column has 2 unique values, Emotion column has 4 unique values,and Emotion_Lvl column has 4 unique values.

  • Age
In [226]:
audio['Age'].unique()
Out[226]:
array([51, 21, 42, 29, 58, 38, 46, 24, 27, 32, 23, 22, 61, 25, 30, 59, 48,
       33, 44, 57, 31, 74, 49, 45, 37, 40, 62, 56, 35, 36, 52, 28, 53, 66,
       34, 41, 20, 50], dtype=int64)
In [286]:
age=pd.crosstab(index=audio['Age'],
           columns='Total')
age.sort_values(by=['Total'],ascending=False).T
Out[286]:
Age 21 25 22 33 42 24 29 30 31 51 ... 41 52 50 49 44 37 35 28 74 58
col_0
Total 91 91 78 78 65 51 40 39 39 39 ... 13 13 13 13 13 13 13 13 13 12

1 rows × 38 columns

Insight:

  • The unique() function result above shows that the minimun age is 20 and maximum age is 74.
  • The crosstab() function result above shows that the 21 and 25 years old took the first and second positions with a total of 91, then the 22 and 33 years old took the second and third positions with a total of 78. And last position is people with 58 year old.
  • Sex
In [227]:
audio['Sex'].unique()
Out[227]:
['Male', 'Female']
Categories (2, object): ['Female', 'Male']
In [289]:
sex=pd.crosstab(index=audio['Sex'],
           columns='Total')
sex.sort_values(by=['Total'],ascending=False).plot(kind='bar',rot=0)
Out[289]:
<Axes: xlabel='Sex'>

Insight:

  • The unique() function result above shows that the Sex column has two values which are Male and Female.
  • The Male is higher than Female with a total of 625.
  • The Female has a total of 558.
  • Race
In [228]:
audio['Race'].unique()
Out[228]:
['Caucasian', 'African_American', 'Asian', 'Unknown']
Categories (4, object): ['African_American', 'Asian', 'Caucasian', 'Unknown']
In [266]:
race=pd.crosstab(index=audio['Race'],
           columns='Total')

race.sort_values(by=['Total'],ascending=False).plot(kind='bar',rot=0)
Out[266]:
<Axes: xlabel='Race'>

Insight:

  • The unique() function result above shows that the Race column consist of Caucasian, African_American, Asian, and Unknown.
  • The plot with crosstab() function result above shows that the Caucasian race is higher than others with the amount above 700. And the lowest positions are Asian race and Unknown race with the amount below 100.
  • Ethnicity
In [230]:
audio['Ethnicity'].unique()
Out[230]:
['Not_Hispanic', 'Hispanic']
Categories (2, object): ['Hispanic', 'Not_Hispanic']
In [292]:
eth=pd.crosstab(index=audio['Ethnicity'],
           columns='Total')
eth.sort_values(by=['Total'],ascending=False).plot(kind='bar',rot=0)
Out[292]:
<Axes: xlabel='Ethnicity'>

Insight:

  • The unique() function result above shows that the Ethnicity column consist of Not_Hispanic and Hispanic.
  • Not_Hispanic Ethnicity has a total of 1053.
  • Hispanic Ethnicity has a total of 130.
  • Emotion
In [231]:
audio['Emotion'].unique()
Out[231]:
['ANG', 'HAP', 'NEU', 'SAD']
Categories (4, object): ['ANG', 'HAP', 'NEU', 'SAD']
In [294]:
emo=pd.crosstab(index=audio['Emotion'],
           columns='Total')
emo.sort_values(by=['Total'],ascending=False).plot(kind='bar',rot=0)
Out[294]:
<Axes: xlabel='Emotion'>

Insight:

  • The unique() function result above shows that the Emotion column consist of ANG (Anger), HAP (Happiness), NEU (Neutral), and SAD (Sadness).
  • Several emotions ranging from anger, happiness, sadness which have a total amount of 364.
  • The neutral emotion has a total amount of 91.
  • Emotion_Lvl
In [232]:
audio['Emotion_Lvl'].unique()
Out[232]:
['HI', 'LO', 'MD', 'XX']
Categories (4, object): ['HI', 'LO', 'MD', 'XX']
In [296]:
lvl=pd.crosstab(index=audio['Emotion_Lvl'],
           columns='Total')
lvl.sort_values(by=['Total'],ascending=False).plot(kind='bar',rot=0)
Out[296]:
<Axes: xlabel='Emotion_Lvl'>

Insight:

  • The unique() function result above shows that there are four emotion levels which are HI (High), LO (Low), MD (Medium), and XX (Unspecified).
  • Several emotion levels ranging from High, Low, Medium which have a total amount of 273.
  • The unspecified emotion level has a total amount of 364.

4.4 Extracting Data¶

4.4.1 Amplitudo Envelope¶

The amplitude envelope is a curve that represents the change in amplitude of an audio signal over time. This envelope provides information about the dynamics of the audio signal and can assist in the analysis and extraction of sound features that are useful in various applications such as speech recognition, music analysis, and other audio processing.

A. Anger Audio¶

High Emotion¶
In [388]:
# Display Anger with High Emotion Level Audio Player - 1001_IEO_ANG_HI.wav
ipd.Audio("data_input/1001_IEO_ANG_HI.wav")
Out[388]:
Your browser does not support the audio element.
Low Emotion¶
In [389]:
# Display Anger with Low Emotion Level Audio Player - 1088_IEO_ANG_LO.wav
ipd.Audio("data_input/1088_IEO_ANG_LO.wav")
Out[389]:
Your browser does not support the audio element.
Medium Emotion¶
In [390]:
# Display Anger with Medium Emotion Level Audio Player - 1018_IEO_ANG_MD.wav
ipd.Audio("data_input/1018_IEO_ANG_MD.wav")
Out[390]:
Your browser does not support the audio element.
Unspecified Emotion¶
In [391]:
# Display Anger with Unspecified Emotion Level Audio Player - 1019_MTI_ANG_XX.wav
ipd.Audio("data_input/1019_MTI_ANG_XX.wav")
Out[391]:
Your browser does not support the audio element.
In [21]:
FIG_SIZE_L = (10,15)
PATH_L = "data_input/"
files =  ["1001_IEO_ANG_HI.wav", "1088_IEO_ANG_LO.wav", "1018_IEO_ANG_MD.wav", "1019_MTI_ANG_XX.wav"]

for item in files:
    FILE_PATH_L = PATH_L + item
    # load audio file with Librosa
    signal, sample_rate = librosa.load(FILE_PATH_L, sr=44100)
    
    # display waveform
    plt.figure(figsize=(12, 4))
    librosa.display.waveshow(signal, sr=sample_rate, alpha=0.4)
    plt.xlabel("Time (s)")
    plt.ylabel("Amplitude")
    plt.yticks(np.arange(-1, 1.25, 0.5))
    plt.title(f"Waveform ({'_'.join(item.split(sep='_')[2:4]).replace('.wav','')})")
    plt.show()

Insight:

The plot above have several insights, as follows:

  • The strength or intensity of the audio signal over a specific time interval for anger emotion with various emotion levels have many number of peak amplitude points (sometimes above 0,5) with small distance between peak amplitude points in a longer duration.

In anger emotion with various emotion levels, the amplitude envelope is used to extract features related to the strength and intensity of the sound, which can then be used as features in machine learning models to classify anger emotion with various emotion levels. Hence, by knowing the amplitude as the strength and intensity of the sound, the customer can be identified by the company for who doesn't love / is unsatisfied toward the company's product/service/specific campaign/questionnaire questions in conducting the market research with this machine learning model.

B. Happiness Audio¶

High Emotion¶
In [113]:
# Display Happiness with High Emotion Level Audio Player - 1090_IEO_HAP_HI.wav
ipd.Audio("data_input/1090_IEO_HAP_HI.wav")
Out[113]:
Your browser does not support the audio element.
Low Emotion¶
In [112]:
# Display Happiness with Low Emotion Level Audio Player - 1065_IEO_HAP_LO.wav
ipd.Audio("data_input/1065_IEO_HAP_LO.wav")
Out[112]:
Your browser does not support the audio element.
Medium Emotion¶
In [24]:
# Display Happiness with Medium Emotion Level Audio Player - 1044_IEO_HAP_MD.wav
ipd.Audio("data_input/1044_IEO_HAP_MD.wav")
Out[24]:
Your browser does not support the audio element.
Unspecified Emotion¶
In [25]:
# Display Happiness with Unspecified Emotion Level Audio Player - 1029_IWL_HAP_XX.wav
ipd.Audio("data_input/1029_IWL_HAP_XX.wav")
Out[25]:
Your browser does not support the audio element.
In [26]:
FIG_SIZE_H = (10,15)
PATH_H = "data_input/"
files_H =  ["1090_IEO_HAP_HI.wav", "1065_IEO_HAP_LO.wav", "1044_IEO_HAP_MD.wav", "1029_IWL_HAP_XX.wav"]

for item in files_H:
    FILE_PATH_H = PATH_H + item
    # load audio file with Librosa
    signal, sample_rate = librosa.load(FILE_PATH_H, sr=44100)
    
    # display waveform
    plt.figure(figsize=(12, 4))
    librosa.display.waveshow(signal, sr=sample_rate, alpha=0.4)
    plt.xlabel("Time (s)")
    plt.ylabel("Amplitude")
    plt.yticks(np.arange(-1, 1.25, 0.5))
    plt.title(f"Waveform ({'_'.join(item.split(sep='_')[2:4]).replace('.wav','')})")
    plt.show()

Insight:

The plot above have several insights, as follows:

  • The strength or intensity of the audio signal over a specific time interval for happiness emotion with various emotion levels have moderate number of peak amplitude points (below 0,5) in moderate duration.

In happiness emotion with various emotion levels, the amplitude envelope is used to extract features related to the strength and intensity of the sound, which can then be used as features in machine learning models to classify happiness emotion with various emotion levels. Hence, by knowing the amplitude as the strength and intensity of the sound, the customer can be identified by the company for who loves / is satisfied toward the company's product/service/specific campaign/specific questionnaire question in conducting the market research with this machine learning model.

C. Sadness Audio¶

High Emotion¶
In [129]:
# Display Sadness with High Emotion Level Audio Player - 1054_IEO_SAD_HI.wav
ipd.Audio("data_input/1054_IEO_SAD_HI.wav")
Out[129]:
Your browser does not support the audio element.
Low Emotion¶
In [131]:
# Display Sadness with Low Emotion Level Audio Player - 1055_IEO_SAD_LO.wav
ipd.Audio("data_input/1055_IEO_SAD_LO.wav")
Out[131]:
Your browser does not support the audio element.
Medium Emotion¶
In [130]:
# Display Sadness with Medium Emotion Level Audio Player - 1043_IEO_SAD_MD.wav
ipd.Audio("data_input/1043_IEO_SAD_MD.wav")
Out[130]:
Your browser does not support the audio element.
Unspecified Emotion¶
In [32]:
# Display Sadness with Unspecified Emotion Level Audio Player - 1035_IOM_SAD_XX.wav
ipd.Audio("data_input/1035_IOM_SAD_XX.wav")
Out[32]:
Your browser does not support the audio element.
In [33]:
FIG_SIZE_S = (10,15)
PATH_S = "data_input/"
files_S =  ["1054_IEO_SAD_HI.wav", "1055_IEO_SAD_LO.wav", "1043_IEO_SAD_MD.wav", "1035_IOM_SAD_XX.wav"]

for item in files_S:
    FILE_PATH_S = PATH_S + item
    # load audio file with Librosa
    signal, sample_rate = librosa.load(FILE_PATH_S, sr=44100)
    
    # display waveform
    plt.figure(figsize=(12, 4))
    librosa.display.waveshow(signal, sr=sample_rate, alpha=0.4)
    plt.xlabel("Time (s)")
    plt.ylabel("Amplitude")
    plt.yticks(np.arange(-1, 1.25, 0.5))
    plt.title(f"Waveform ({'_'.join(item.split(sep='_')[2:4]).replace('.wav','')})")
    plt.show()

Insight:

The plot above have several insights, as follows:

  • The strength or intensity of the audio signal over a specific time interval for sadness emotion with various emotion levels have different very small amplitude values (below 0.5) in a short time.

In sadness emotion with various emotion levels, the amplitude envelope is used to extract features related to the strength and intensity of the sound, which can then be used as features in machine learning models to classify sadness emotion with various emotion levels. Hence, by knowing the amplitude as the strength and intensity of the sound, the customer can be identified by the company for who has certain pain of jobs that need to be done and for who is sad towards the company's product/service/specific campaign/questionnaire questions in conducting the market research with this machine learning model.

D. Neutral Audio¶

Unspecified Emotion 1¶
In [136]:
# Display Sadness with High Emotion Level Audio Player - 1054_IEO_SAD_HI.wav
ipd.Audio("data_input/1091_IWW_NEU_XX.wav")
Out[136]:
Your browser does not support the audio element.
Unspecified Emotion 2¶
In [137]:
# Display Sadness with Low Emotion Level Audio Player - 1055_IEO_SAD_LO.wav
ipd.Audio("data_input/1050_IWW_NEU_XX.wav")
Out[137]:
Your browser does not support the audio element.
Unspecified Emotion 3¶
In [104]:
# Display Sadness with Medium Emotion Level Audio Player - 1043_IEO_SAD_MD.wav
ipd.Audio("data_input/1017_IWW_NEU_XX.wav")
Out[104]:
Your browser does not support the audio element.
Unspecified Emotion 4¶
In [103]:
# Display Sadness with Unspecified Emotion Level Audio Player - 1035_IOM_SAD_XX.wav
ipd.Audio("data_input/1001_IWW_NEU_XX.wav")
Out[103]:
Your browser does not support the audio element.
In [34]:
FIG_SIZE_N = (10,15)
PATH_N = "data_input/"
files_N =  ["1091_IWW_NEU_XX.wav", "1050_IWW_NEU_XX.wav", "1017_IWW_NEU_XX.wav", "1001_IWW_NEU_XX.wav"]

for item in files_N:
    FILE_PATH_N = PATH_N + item
    # load audio file with Librosa
    signal, sample_rate = librosa.load(FILE_PATH_N, sr=44100)
    
    # display waveform
    plt.figure(figsize=(12, 4))
    librosa.display.waveshow(signal, sr=sample_rate, alpha=0.4)
    plt.xlabel("Time (s)")
    plt.ylabel("Amplitude")
    plt.yticks(np.arange(-1, 1.25, 0.5))
    plt.title(f"Waveform ({'_'.join(item.split(sep='_')[2:4]).replace('.wav','')})")
    plt.show()

Insight:

The plot above have several insights, as follows:

  • The strength or intensity of the audio signal over a specific time interval for neutral emotion with unspecified level has different amplitude values (below 0.5) in a short time (below 2 second).

In neutral emotion with unspecified level, amplitude envelope is used to extract features related to the strength and intensity of the sound, which can then be used as features in machine learning models to classify neutral emotion with unspecified level in the sound. Hence, by knowing the amplitude as the strength and intensity of the sound, the customer can be identified by the company for who has neutral emotion towards the company's product/service/specific campaign/questionnaire questions in conducting the market research with this machine learning model.


5. EDA - Feature Extraction¶

5.1 PROSODIC¶

5.1.1 Pitch¶

The pitch of a sound is determined by the frequency of the vibrations produced by the vocal cords in the larynx, which is then translated into audible sound waves. Measured in Hertz (Hz), pitch refers to the high or low tone of a sound and can be used to differentiate between different sounds, including human speech. This information can provide insights into the emotions, intonation, and even the identity of the speaker. For example, the intensity of emotions expressed in speech can be conveyed through pitch, with high pitch conveying excitement or anxiety and low pitch indicating boredom or depression.

In [101]:
# Define directory path and initialize empty dataframe
dir_path = 'data_input/'
df_pitch = pd.DataFrame(columns=['filename', 'pitch_mean', 'pitch_std', 'pitch_min', 'pitch_max'])

# Loop through each file in the directory
for filename in os.listdir(dir_path):
    # Check if file name starts with a number between 1001 and 1091
    if filename.startswith(tuple([str(i) for i in range(1001, 1092)])):
        # Extract the number from the file name
        number = int(filename.split('_')[0])
        # Load audio file and extract pitch features
        audio_file = os.path.join(dir_path, filename)
        y, sr = librosa.load(audio_file, sr=44100)
        pitches, magnitudes = librosa.piptrack(y=y, sr=sr)
        pitch_mean = np.mean(pitches)
        pitch_std = np.std(pitches)
        pitch_min = np.min(pitches)
        pitch_max = np.max(pitches)

        # Append feature values to the dataframe
        df_pitch = pd.concat([df_pitch, pd.DataFrame({
            'filename': [filename],
            'number': [number],
            'pitch_mean': [pitch_mean],
            'pitch_std': [pitch_std],
            'pitch_min': [pitch_min],
            'pitch_max': [pitch_max]
        })], ignore_index=True)

# Assign the dataframe to a variable
pitch_features = df_pitch


# Sort by filename index-0, index-2, and index-3
pitch_features = pitch_features.sort_values(by=['filename'])
pitch_features[['col0', 'col2', 'col3']] = pitch_features['filename'].str.split('_', expand=True)[[0, 2, 3]]
pitch_features = pitch_features.sort_values(by=['col0', 'col2', 'col3'])

# Drop unnecessary columns
pitch_features = pitch_features.drop(columns=['col0', 'col2', 'col3'])

pitch_features = pitch_features.reset_index(drop=True)
pitch_features=pitch_features.drop(['number'],axis=1)
pitch_features
Out[101]:
filename pitch_mean pitch_std pitch_min pitch_max
0 1001_IEO_ANG_HI.wav 14.328611 169.545547 0.0 3994.187988
1 1001_IEO_ANG_LO.wav 9.142864 122.549446 0.0 3994.366699
2 1001_IEO_ANG_MD.wav 9.704618 130.087921 0.0 3993.887939
3 1001_WSI_ANG_XX.wav 15.859626 189.066589 0.0 3992.139404
4 1001_IEO_HAP_HI.wav 16.429127 184.560547 0.0 3965.526123
... ... ... ... ... ...
1178 1091_IWW_NEU_XX.wav 4.151556 80.656540 0.0 3989.319824
1179 1091_IEO_SAD_HI.wav 3.047120 55.561882 0.0 3826.993896
1180 1091_IEO_SAD_LO.wav 2.854729 54.097889 0.0 3989.583984
1181 1091_IEO_SAD_MD.wav 3.207180 57.784866 0.0 3935.103516
1182 1091_IOM_SAD_XX.wav 1.456717 31.115238 0.0 3892.249023

1183 rows × 5 columns

In [135]:
a = pitch_features.iloc[[0,4,9,1,5,10,2,6,11,3,7,12]]
b = pitch_features.iloc[[8,21,34,47]]
pitch_sample = pd.concat([a,b])
pitch_sample = pitch_sample.reset_index(drop=True)
pitch_sample
Out[135]:
filename pitch_mean pitch_std pitch_min pitch_max
0 1001_IEO_ANG_HI.wav 14.328611 169.545547 0.0 3994.187988
1 1001_IEO_HAP_HI.wav 16.429127 184.560547 0.0 3965.526123
2 1001_IEO_SAD_HI.wav 8.878007 121.977211 0.0 3896.472900
3 1001_IEO_ANG_LO.wav 9.142864 122.549446 0.0 3994.366699
4 1001_IEO_HAP_LO.wav 9.295731 127.963921 0.0 3965.618408
5 1001_IEO_SAD_LO.wav 7.366590 104.775467 0.0 3980.411133
6 1001_IEO_ANG_MD.wav 9.704618 130.087921 0.0 3993.887939
7 1001_IEO_HAP_MD.wav 11.731209 151.486237 0.0 3990.189697
8 1001_IEO_SAD_MD.wav 10.933606 142.445541 0.0 3992.952393
9 1001_WSI_ANG_XX.wav 15.859626 189.066589 0.0 3992.139404
10 1001_TAI_HAP_XX.wav 11.863870 163.126831 0.0 3988.996338
11 1001_IOM_SAD_XX.wav 11.345793 147.691315 0.0 3993.049805
12 1001_IWW_NEU_XX.wav 8.466592 120.348625 0.0 3962.358643
13 1002_IWW_NEU_XX.wav 5.133738 83.827484 0.0 3992.776123
14 1003_IWW_NEU_XX.wav 7.740101 113.748917 0.0 3991.625244
15 1004_IWW_NEU_XX.wav 6.679007 100.423401 0.0 3920.516602

Insight:

The data frame above which consists of the pitch with different emotions also different emotions have several insights, as follows:

  • In high level of emotion, it can be seen from pitch standard deviation (pitch_std) that the 50% of pitch value is located in the range of 121.977211 to 184.560547. Then, the happiness with high emotion level has higher mean of pitch (pitch_mean) than anger with high level emotion also the sadness with high emotion occupies the latter position.
  • In low level of emotion, it can be seen from pitch standard deviation (pitch_std) that the 50% of pitch value is located in the range of 104.775467 to 127.963921. Then, the happiness with low emotion level has higher mean of pitch (pitch_mean) than anger with low level emotion also the sadness with low emotion occupies the latter position.
  • In medium level of emotion, it can be seen from pitch standard deviation (pitch_std) that the 50% of pitch value is located in the range of 130.087921 to 151.486237. Then, the happiness with medium emotion level has higher mean of pitch (pitch_mean) than sadness with medium level emotion also the anger with medium emotion occupies the latter position.
  • In unspecified level of emotion, it can be seen from pitch standard deviation (pitch_std) of three of emotions (anger, happiness, sadness) that the 50% of pitch value is located in the range of 147.691315 to 189.066589. And, it can be seen from pitch standard deviation (pitch_std) of neutral emotion is located in the range of 83.827484 to 120.348625. Then, the anger with unspecified emotion level has higher mean of pitch (pitch_mean) than happiness with unspecified level emotion also the sadness with low emotion occupies the latter position. And then, the range of pitch mean (pitch_mean) for neutral emotion is 5.133738 to 8.466592.
  • The maximum of pitch (pitch_max) for all emotions and levels have a score above 3900.
  • The last, the unspecified emotion level for each emotion helps the machine learning to gain the pitch value data literacy ranging from mean value, standard deviation value, maximum value. So, the machine learning can predict emotions in various pitch characteristics.

The pitch value can be used as a reference / train data for machine learning model in emotion speech recognition project. Hence, by knowing the pitch from various emotions and levels, the customer can be identified by the company for who is anger/happiness/sadness/neutral toward the company's specific questionnaire questions/product/service/specific campaign in conducting the market research with this machine learning model.


5.1.2 Energy¶

Energy in prosodic feature extraction refers to the strength of the sound signal in each frame. In audio signals, calculate the root-mean-square (RMS) value for each frame and the total magnitude of the signal can indicate how loud the signal is. Generally, energy is calculated by squaring each sample in the frame and then adding the results. In some cases, the square root of the sum of squares is also calculated to produce an energy value in the same unit as amplitude. Energy features are often used to detect aspects such as intensity, emphasis, and heartbeat in speech.

In the analysis of prosodic feature extraction for energy, we can obtain some insights or understanding of the characteristics of the analyzed sound, including:

  • Intensity or loudness: The higher the energy value produced, the louder the sound produced by the speaker.

  • Emotion or expression: Energy values can provide an idea of the expression or emotion conveyed by the speaker. For example, speakers who are angry or happy tend to have higher energy values.

  • Physical condition: Significant changes in energy values in a speaker's voice can indicate changes in their physical condition. For example, someone who is sick may have lower energy values in their voice.

  • Diction / Speech style: A person's speech style can be reflected in the energy values of their voice. For example, someone who tends to speak in a monotone or calm intonation may have lower energy values.

In sound processing and prosodic analysis, energy values are often used as features to identify various sound characteristics such as intonation, accent, tempo, and emotion. Therefore, understanding energy values can help in understanding various aspects of human sound and can be used for various applications such as voice recognition, emotion analysis, and natural language processing.

A. Anger Emotion Energy¶

In [123]:
FIG_SIZE_S = (10,15)
PATH_S = "data_input/"
files_S =  ["1001_IEO_ANG_HI.wav", "1088_IEO_ANG_LO.wav", "1018_IEO_ANG_MD.wav", "1019_MTI_ANG_XX.wav"]

for item in files_S:
    FILE_PATH_S = PATH_S + item
    # load audio file with Librosa
    y, sr = librosa.load(FILE_PATH_S, sr=44100)
    
    # Display RMS Energy
    S, phase = librosa.magphase(librosa.stft(y))
    rms = librosa.feature.rms(S=S)

    fig, ax = plt.subplots(figsize=(15, 6), nrows=2, sharex=True)
    times = librosa.times_like(rms)
    ax[0].semilogy(times, rms[0], label='RMS Energy')
    ax[0].set(xticks=[])
    ax[0].legend()
    ax[0].label_outer()
    librosa.display.specshow(librosa.amplitude_to_db(S, ref=np.max),
                             y_axis='log', x_axis='time', ax=ax[1])
    plt.title(f"log Power spectrogram of ({'_'.join(item.split(sep='_')[2:4]).replace('.wav','')})")
    plt.show()

Insight:

The plot above have several insights, as follows:

  • Mostly RMS values for angry emotions have fluctuating energy and tend to have uptrends with the duration of time is shorter than the happiness emotion.
  • The RMS value for the anger emotion with low level has a first spike of high RMS value at low duration.
  • The RMS value for the anger emotion with mid level is consistently high, as this anger emotion with mid level is loud and intense throughout.
  • The RMS value for the anger emotion with unspecified level has a small peak amplitude of rms energy.

This sample of anger emotion audio data can be used as a reference / train data for machine learning model in emotion speech recognition project. Hence, by knowing the energy / loudness of anger emotion and the anger characteristics, the customer can be identified by the company for who doesn't love / is unsatisfied toward the company's product/service/specific campaign/questionnaire questions in conducting the market research with this machine learning model.

B. Happiness Emotion Energy¶

In [127]:
FIG_SIZE_S = (10,15)
PATH_S = "data_input/"
files_S =  ["1090_IEO_HAP_HI.wav", "1065_IEO_HAP_LO.wav", "1044_IEO_HAP_MD.wav", "1029_IWL_HAP_XX.wav"]

for item in files_S:
    FILE_PATH_S = PATH_S + item
    # load audio file with Librosa
    y, sr = librosa.load(FILE_PATH_S, sr=44100)
    
    # Display RMS Energy
    S, phase = librosa.magphase(librosa.stft(y))
    rms = librosa.feature.rms(S=S)

    fig, ax = plt.subplots(figsize=(15, 6), nrows=2, sharex=True)
    times = librosa.times_like(rms)
    ax[0].semilogy(times, rms[0], label='RMS Energy')
    ax[0].set(xticks=[])
    ax[0].legend()
    ax[0].label_outer()
    librosa.display.specshow(librosa.amplitude_to_db(S, ref=np.max),
                             y_axis='log', x_axis='time', ax=ax[1])
    plt.title(f"log Power spectrogram of ({'_'.join(item.split(sep='_')[2:4]).replace('.wav','')})")
    plt.show()

Insight:

The plot above have several insights, as follows:

  • Mostly RMS values for happiness emotions have a tight wave in the initial time that is longer than the time of anger emotions.
  • When the RMS value for the happiness emotion with high level is going down tends to bounce quickly and form an uptrend.
  • The RMS value for the happiness emotion with low level has the character of forming a fast uptrend and downtrend at the beginning of time / in a short time duration.
  • The RMS value for the happiness emotion with mid level is consistently high, as this happiness emotion with mid level is loud and intense throughout.
  • The RMS value for the happiness emotion with unspecified level tends to has a high fluctuating rms values at the mid of time.

This sample of happiness emotion audio data can be used as a reference / train data for machine learning model in emotion speech recognition project. Hence, by knowing the energy / loudness of happiness emotion and the happiness characteristics, the customer can be identified by the company for who loves / is satisfied toward the company's product/service/specific campaign/questionnaire questions in conducting the market research with this machine learning model.

C. Sadness Emotion Energy¶

In [126]:
FIG_SIZE_S = (10,15)
PATH_S = "data_input/"
files_S =  ["1054_IEO_SAD_HI.wav", "1055_IEO_SAD_LO.wav", "1043_IEO_SAD_MD.wav", "1035_IOM_SAD_XX.wav"]

for item in files_S:
    FILE_PATH_S = PATH_S + item
    # load audio file with Librosa
    y, sr = librosa.load(FILE_PATH_S, sr=44100)
    
    # Display RMS Energy
    S, phase = librosa.magphase(librosa.stft(y))
    rms = librosa.feature.rms(S=S)

    fig, ax = plt.subplots(figsize=(15, 6), nrows=2, sharex=True)
    times = librosa.times_like(rms)
    ax[0].semilogy(times, rms[0], label='RMS Energy')
    ax[0].set(xticks=[])
    ax[0].legend()
    ax[0].label_outer()
    librosa.display.specshow(librosa.amplitude_to_db(S, ref=np.max),
                             y_axis='log', x_axis='time', ax=ax[1])
    plt.title(f"log Power spectrogram of ({'_'.join(item.split(sep='_')[2:4]).replace('.wav','')})")
    plt.show()

Insight:

The plot above have several insights, as follows:

  • Mostly RMS values for sadness emotions have fluctuating energy within low energy and high energy.
  • When the RMS value for the sadness emotion with high level has a quite high in a short time duration at the middle of time.
  • The RMS value for the sadness emotion with low level has the character of forming high energy at the beginning until end of time.
  • The RMS value for the sadness emotion with mid level is consistently high, as this sadness emotion with mid level is quite loud and intense throughout.
  • The RMS value for the sadness emotion with unspecified level tends to has a fluctuating rms values and form sideways.

This sample of sadness emotion audio data can be used as a reference / train data for machine learning model in emotion speech recognition project. Hence, by knowing the energy / loudness of sadness emotion and the sadness characteristics, the customer can be identified by the company for who has certain pain of jobs that need to be done and for who is sad towards the company's product/service/specific campaign/questionnaire questions in conducting the market research with this machine learning model.

D. Neutral Emotion Energy¶

In [132]:
FIG_SIZE_S = (10,15)
PATH_S = "data_input/"
files_S =  ["1091_IWW_NEU_XX.wav", "1050_IWW_NEU_XX.wav", "1017_IWW_NEU_XX.wav", "1001_IWW_NEU_XX.wav"]

for item in files_S:
    FILE_PATH_S = PATH_S + item
    # load audio file with Librosa
    y, sr = librosa.load(FILE_PATH_S, sr=44100)
    
    # Display RMS Energy
    S, phase = librosa.magphase(librosa.stft(y))
    rms = librosa.feature.rms(S=S)

    fig, ax = plt.subplots(figsize=(15, 6), nrows=2, sharex=True)
    times = librosa.times_like(rms)
    ax[0].semilogy(times, rms[0], label='RMS Energy')
    ax[0].set(xticks=[])
    ax[0].legend()
    ax[0].label_outer()
    librosa.display.specshow(librosa.amplitude_to_db(S, ref=np.max),
                             y_axis='log', x_axis='time', ax=ax[1])
    plt.title(f"log Power spectrogram of ({'_'.join(item.split(sep='_')[2:4]).replace('.wav','')})")
    plt.show()

Insight:

The plot above have several insights, as follows:

  • The sample of neutral emotion only has unspecified emotion level. Then, the plot shows that there are various rms energy / loudness also various characteristic at the beginning of time, middle of time, end of time. Therefore, the unspecified emotion level for each emotion helps the machine learning to gain the rms energy data literacy, so that the machine learning can predict emotions in various characteristics / rms energy.

This sample of neutral emotion audio data can be used as a reference / train data for machine learning model in emotion speech recognition project. Hence, by knowing the energy / loudness of neutral emotion and the sadness characteristics, the customer can be identified by the company for who has neutral emotion towards the company's product/service/specific campaign/questionnaire questions in conducting the market research with this machine learning model.


5.2 SPECTRAL¶

5.2.1 Short-time Fourier Transform¶

The Short-time Fourier Transform (STFT) is a signal processing technique that allows us to examine the frequency characteristics of a signal over time. Its main purpose is to analyze signals that exhibit frequency variations over time, such as speech.

By dividing the signal into short, overlapping time segments using a windowing function, the STFT computes the Fourier Transform for each segment, revealing the frequency components present during that specific interval. By performing this analysis on successive time segments, we can track how the frequency content of the signal evolves over time.

In the context of emotion speech recognition classification, the STFT plays a crucial role. It facilitates the capture and analysis of temporal fluctuations in the frequency content of speech signals. This is highly significant because it enables the identification and differentiation of various emotional states conveyed through speech.

The STFT's significance in emotion speech recognition classification projects lies in its ability to effectively capture and analyze the changing frequency characteristics of speech signals over time. This analysis is instrumental in discerning and distinguishing different emotional states expressed in speech.

In [351]:
FIG_SIZE_STFT = (8, 5)
PATH_S = "data_input/"
files_S = ["1082_IEO_ANG_HI.wav", "1047_IEO_HAP_HI.wav", "1049_IEO_SAD_HI.wav", "1001_IWW_NEU_XX.wav", "1017_IWW_NEU_XX.wav"]



for item in files_S:
    FILE_PATH_S = PATH_S + item
    # load audio file with Librosa
    y, sr = librosa.load(FILE_PATH_S)

    # STFT -> spectrogram
    hop_length = 512  # in num. of samples
    n_fft = 2048  # window in num. of samples

    # calculate duration hop length and window in seconds
    hop_length_duration = float(hop_length) / sr
    n_fft_duration = float(n_fft) / sr

    print("STFT hop length duration is: {} second".format(hop_length_duration))
    print("STFT window duration is: {} second".format(n_fft_duration))

    # perform stft
    stft = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)

    # calculate absolute values on complex numbers to get magnitude
    spectrogram = np.abs(stft)

    # apply logarithm to cast amplitude to Decibels
    log_spectrogram = librosa.amplitude_to_db(spectrogram)

    # extract name parts from the file name
    name_parts = item.split('_')
    name_parts_selected = name_parts[2:4]  # select parts at index 2 and 3

    # combine selected name parts with underscore separator
    title = '_'.join(name_parts_selected)

    # display spectrogram
    plt.figure(figsize=FIG_SIZE_STFT)

    librosa.display.specshow(log_spectrogram, sr=sr, hop_length=hop_length, x_axis='time')
    plt.xlabel("Time")
    plt.ylabel("Frequency")
    plt.colorbar()
    plt.title("Spectrogram - {}".format(title))  # Judul plot dengan nama file terpilih
    plt.show()
STFT hop length duration is: 0.023219954648526078 second
STFT window duration is: 0.09287981859410431 second
STFT hop length duration is: 0.023219954648526078 second
STFT window duration is: 0.09287981859410431 second
STFT hop length duration is: 0.023219954648526078 second
STFT window duration is: 0.09287981859410431 second
STFT hop length duration is: 0.023219954648526078 second
STFT window duration is: 0.09287981859410431 second
STFT hop length duration is: 0.023219954648526078 second
STFT window duration is: 0.09287981859410431 second

Insight:

  • The amplitude-frequency distribution of the anger emotion with a high-level audio signal has a wide time range, and the frequency amplitude is mostly distributed in the time range above 0 - 2 seconds. And, this indicates that there are dominant frequency components within that time range that may be relevant to the expression of the "anger" emotion. This information can be useful in identifying and distinguishing the "anger" emotion of customer who doesn't love/is unsatisfied with the company's product/service/specific campaign/questionnaire questions in conducting the market research with this machine learning model which uses STFT analysis on audio signals.

  • The amplitude-frequency distribution of the happiness emotion with a high-level audio signal has amplitude frequency whose distribution is fairly evenly distributed and the peak amplitude of the dominant frequency is distributed in the time range above 0 to 1 second. This information can be useful in identifying and distinguishing the "happiness" emotion of customer who loves / is satisfied toward the company's product/service/specific campaign/questionnaire questions in conducting the market research with this machine learning model.

  • The amplitude-frequency distribution of sad emotions with high-level audio signals has a weak amplitude-frequency distribution in the time range 2 - 2.5 second. This information can be useful in identifying and distinguishing the "sadness" emotion of customer who has certain pain of jobs that need to be done and for who is sad towards the company's product/service/specific campaign/questionnaire questions in conducting the market research with this machine learning model.

  • The frequency amplitude distribution of neutral emotion with unspecified level audio exhibits a higher literacy for the distribution of frequency amplitudes. This is because audio for neutral emotion has various characteristics, particularly in terms of frequency amplitude distribution.

So, the highest level of amplitude with a long duration of amplitude density corresponds to the anger emotion. Next, the emotions of happiness and neutral exhibit relatively low to moderate levels of amplitude with density in the intermediate duration. Lastly, the amplitude levels in a small range with density in the shorter duration indicate the sad emotion.


5.2.2 Mel Frequency Cepstral Coefficents (MFCCs)¶

The Mel-Frequency Cepstral Coefficients (MFCCs) are widely utilized in speech recognition and involve converting the power spectrum of a sound into a Mel-scale, which accounts for the human auditory perception. This enables differentiation of voices based on their distinct frequency ranges. MFCCs, which typically consist of a small set of features (usually around 10-20), succinctly describe the overall shape of the spectral envelope and capture the characteristic properties of the human voice.

The MFCC coefficients effectively capture the essential spectral characteristics of the audio signal, highlighting the perceptually significant components. These coefficients can be employed as features for diverse tasks in audio and speech processing, including speech recognition, speaker identification, and emotion recognition. MFCC provides a concise representation of the spectral content of an audio signal, incorporating both frequency and perceptual characteristics. This makes it a potent tool for the analysis and processing of audio signals.

In [ ]:
for item in files_S:
    FILE_PATH_S = PATH_S + item
    # load audio file with Librosa
    y, sr = librosa.load(FILE_PATH_S, sr=44100)
    
    # Display RMS Energy
    S, phase = librosa.magphase(librosa.stft(y))
    rms = librosa.feature.rms(S=S)

    fig, ax = plt.subplots(figsize=(15, 6), nrows=2, sharex=True)
    times = librosa.times_like(rms)
    ax[0].semilogy(times, rms[0], label='RMS Energy')
    ax[0].set(xticks=[])
    ax[0].legend()
    ax[0].label_outer()
    librosa.display.specshow(librosa.amplitude_to_db(S, ref=np.max),
                             y_axis='log', x_axis='time', ax=ax[1])
    plt.title(f"log Power spectrogram of ({'_'.join(item.split(sep='_')[2:4]).replace('.wav','')})")
    plt.show()
In [355]:
FIG_SIZE_STFT = (8, 5)
PATH_S = "data_input/"
files_S = ["1082_IEO_ANG_HI.wav", "1047_IEO_HAP_HI.wav", "1049_IEO_SAD_HI.wav", "1001_IWW_NEU_XX.wav", "1017_IWW_NEU_XX.wav"]

for item in files_S:
    FILE_PATH_S = PATH_S + item
    # load audio file with Librosa
    x, sr = librosa.load(FILE_PATH_S, sr=44100)
    mfccs = librosa.feature.mfcc(y=x, sr=sr)
    
    # Extract file name
    file_name = item.split("_")[2] + "_" + item.split("_")[3]
    
    # Display the MFCCs with the file name as the plot title
    fig, ax = plt.subplots(figsize=(15, 3))
    img = librosa.display.specshow(mfccs, sr=sr, x_axis='time')
    fig.colorbar(img, ax=ax)
    ax.set(title=file_name)

plt.show()

Insight:

  • The MFCCs values on sad emotion and neutral emotion with unspecified level seem to be lower and more dynamic than the angry emotion.
  • The MFCCs values on anger emotion seem to be higher than happiness emotion.
In [387]:
FIG_SIZE_STFT = (8, 5)
PATH_S = "data_input/"
files_S = ["1082_IEO_ANG_HI.wav", "1047_IEO_HAP_HI.wav", "1049_IEO_SAD_HI.wav", "1001_IWW_NEU_XX.wav", "1017_IWW_NEU_XX.wav"]

for item in files_S:
    FILE_PATH_S = PATH_S + item
    # load audio file with Librosa
    y, sr = librosa.load(FILE_PATH_S, sr=44100)
    
    S = librosa.feature.melspectrogram(y=y, sr=sr)
    S_dB = librosa.power_to_db(S, ref=np.max)
    
    # Extract file name
    file_name = item.split("_")[2] + "_" + item.split("_")[3]
    
    # Display the Mel-frequency spectrogram with the file name as the plot title
    fig, ax = plt.subplots(figsize=(15, 3))
    img = librosa.display.specshow(S_dB, sr=sr, x_axis='time')
    fig.colorbar(img, ax=ax, format='%+2.0f dB')
    ax.set(title='Mel-frequency spectrogram'+' '+file_name)

    plt.show()
    

Insight:

  • High sound intensity is exhibited by audio with anger and happiness emotions, followed by neutral emotion, and finally sadness emotion.

  • This analysis is important because the frequency information contained in the sound can provide valuable clues about the emotions expressed in speech. By analyzing the spectrogram, we can identify patterns of sound intensity that are associated with specific emotions and use this information to classify emotions in unknown audio data.


6. References¶

  • Pearson, J., & Nelson, P. (2000). An introduction to human communication: Understanding and sharing (p. 6). Boston, MA: McGraw-Hill. In certain business area, communication.
  • https://github.com/oliviatan29/audio_feature_analysis/tree/main
  • https://towardsdatascience.com/get-to-know-audio-feature-extraction-in-python-a499fdaefe42
  • https://github.com/fafilia/speech-emotions-recognition
  • https://medium.com/epfl-extension-school/age-prediction-of-a-speakers-voice-ae9173ceb322
  • https://towardsdatascience.com/get-to-know-audio-feature-extraction-in-python-a499fdaefe42
  • https://github.com/kevbow/Music-Genre-Classification-using-Keras